Picture for Jifeng Dai

Jifeng Dai

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

Add code
May 07, 2025
Viaarxiv icon

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Add code
Apr 21, 2025
Viaarxiv icon

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Add code
Apr 15, 2025
Viaarxiv icon

LangBridge: Interpreting Image as a Combination of Language Embeddings

Add code
Mar 26, 2025
Viaarxiv icon

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Add code
Mar 25, 2025
Viaarxiv icon

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Add code
Mar 13, 2025
Viaarxiv icon

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Add code
Mar 13, 2025
Viaarxiv icon

MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism

Add code
Mar 03, 2025
Viaarxiv icon

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Add code
Jan 14, 2025
Viaarxiv icon

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Add code
Dec 20, 2024
Viaarxiv icon